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Li Proposed solution 
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Terminology and Implementation Considerations 


> Block Group: a block presents in a container group. 


> Each data(d) + parity(p) number of chunks written to block group to finish ‘Stripe’. 


> Parity generated at the client. 
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Write 


Node selection and key space 
management..etc 


Masters 
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EC Block Group : 1 


Placement policies pick the 
required number of nodes for 


EC. 


Nodes set will be chosen based 
on EC schema (3(d):2(p), 6:3, 
10:4). 


After finish writing d * 
block_size data, client requests 


masters to get new node set. 
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Write: Striping 


N1 N2 N3 N4 NS 


Input Filel fer 
(Gee j aul 
Stripe-1 1MB 1MB 
ESAS E (c1, c2, c3) (c1, c2, c3) 
1MB - chunk3 ||| EC 
Client 
A Stripe-2 1MB 
1MB - chunk4 RE Ae Ae 1MB ae 
c4:chunk4 c5:chunk5 c6:chunk6 panya tod A 6) 
: : a c4, c5, cl 
1MB - chunk5 c4, c5, c6 
1MB - chunk6 = = = 
ee CO 
4 a 
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Stripe: One round of data + parity chunks 
called as full stripe. 


Chunks would be written in round robin 
fashion to data nodes. 


Parity Generation: After every data 
number of chunks written, parity will be 
generated and send to remaining nodes in 
group. 


If stripe write fails, the current block 
group will be closed and rewrite the 
failed stripe to new block group. 


Client keep track of bytes written and 
check for failures. 


CoC 2023 NA Oct. 7 — Oct. 10 


Write: Partial Stripe with Padding 


Client uses RI TT 


padding date da EC Block Group : 1 
for generating 


parity chunks if r 
stripe is not full 


N1 N2 N3 N4 


Padding 
Buffer 


0000000 
0000000 
0000000 
0000..... 


file1.txt 


1MB 
parity1 
(c1, 00, 00) 


file2.txt 


1MB - chunk2 | 


1MB 
parity1 
(c1, c2, 00) 


file3.txt 


1MB - chunk1 


M 


1MB - chunk2 B 
¿chunk3 


1MB 
parity1 
| (c1. c2, c3) | 


1 
c3 


1MB - chunk3 


1MB 
c1:chunk1 
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b15 
1MB 


parity2 
(c1, 00, 00) 


1MB 
parity2 
(c1, c2, 00) 


1MB 
parity2 
(c1, c2, c3 


-— AA 


‘is 1MB. J 


~ !Partial Stripe: chunk3 assumed — E 
:as padding data and len is 1MB. l 
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Write: Striping 


> If stripe write fails, the current block group will be closed 
o Failed stripe will be rewritten to the new block group. 
> Client keep track of bytes written and check for failures. 


> After all data writes finishes, then parity writes. 
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Read 


Read File 


1MB - chunk5 E 
1MB - chunk6 
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parity1 
(c1, c2, c3) 


parity1 
(c4, c5, c6) 


1MB 
parity2 
(c1, c2, c3) 


1MB 
parity2 
(c4, c5, c6) 


Reads in the same order in 
which order writes done. 
Order will be based on replica 
Indexes. 


Client stitches the data back 
to original order and serves to 
user. 
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Reconstructional Reads 


> 


> 
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First read will attempt to read data blocks. 


When node failed while reading, client will switch to reconstrional read and 


read from parity and reconstruct the lost data transparently. 
Reconstruction read will have overhead due to ec decode operation. 


To avoid the degraded reads, we need to recover the lost replicas offline. 


Recovery Reads 


blk_1_1 


Stripe-1 | 


1MB 
c4:chunk4 


parity2 
(c4, c5, c6) 


Read File 


| 
Stripe-2 | 


1MB - chunk4 |. 
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Offline Recovery 


What is the Offline Recovery? 


> When anode/Disk lost, we will lose the containers which are residing in that 
node/disk. 

> We need a mechanism to recover that lost containers in the background. 

>  Wecall this process of background recovery as “Offline Recovery”. 

> 


This is very critical background task similar re-replication on node/disk failures. 
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Offline Recovery 
EJ El El 


Coordinator 
DN 


Create recovering state 
containers in target DNs 


ReconstructECContainersCommand 
1. containerlD List blocks 
2. Source replicas: DN1, DN2, DN3 Find Blbcks 
3. Target DNs - DN6, DN7 List 
4. Missing indexes ReadChunk Loop for recover 
5. ECReplicationConfig Block for all Block 
| Readchfnk [| Recover Block 
MC IN 
EC decade and 
recover lost index 
chunks 
Write chunk a 
to target = 
container Transfer chunk to 
the target container 
Target DNs sending ICR 


Cloe the containers i 


TT eos 
Container Recovery Done 
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Reconstruction Time Corruption 
Possibilities 


Offline Recovery 
EN a E E E 


Coordinator 
DN 


Create recovering state 


7 containers in target DNs 
ReconstructECContainersCommand 


1. containerlD List blocks 
2. Source replicas: DN1,DN2,DN3. 00 [ercer ercer rencoroso 
3. Target DNs - DN6, DN7 
4. Missing indexes 

5. ECReplicationConfig 


Find Blpcks 
List 
Loop for recover 
Block for all Block 


EC decode and x 
recover lost index 
chunks 

Wrongly generated l 
Write chunk — |. data leads to wrong ı 
to target 
container Transfer chunk to checksums here i 
the target container | 
: / 
Target DNs sending ICR až 
— — — Clogpethe containers im 0 0T TTT 


YT TO eto 
Container Recovery Done 
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Offline Recovery 
EJ a E El El 


Coordinator 
DN 


Create recovering state 
7 containers in target DNs 
ReconstructECContainersCommand 
1. containerlD List blocks 
2. Source replicas: DN1, DN2,DNB> erotico cocoa fee ft 
3. Target DNs - DN6, DN7 
4. Missing indexes 


5. ECReplicationConfig 


Find Blpcks 
List 
Loop for recover 
Block for all Blpck 


EC decode and N 
recover lost index 


chunks 
Write chunk P 
to target A q 
container Transfer chunk to 


the target container 


\ 
How do we detect? 
l 
l 
l 
l 
| 


Target DNs sending ICR / 


A A ea Oe 
Cloge the containers i 


ee EREE E E e DNs 
Container Recovery Done 


EN eJ E E 
CLOUD=RA 


Corruption Detection & Challenges 


Here is one example corruption situation we dealt with. 


> Some of blocks content turned out with all zeros. 
o Found the issue when compared with source data checksums. 


> Need to find the blocks with all zeros as evidence of corruption: 
https://github.com/umamaheswararao/ec-corruption-analyzer 


> Need additional confirmation to make sure corruption is real : 
https://github.com/sodonnel/hdfs-ec-validator 


> Fixing of such corruption is hard, need to delete the blocks which are impacted in a block group. 
> System will fix with right content automatically. 


> If impacted blocks are more than d number of blocks, it is not possible to fix automatically unless we 
have source data available. 
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Solution 


> Checksum bytes calculated for full 
stripe at client. 


ia N > Checksum bytes are splittable at chunk 
d PC Block Group h Full stripe boundaries. Meaning when checksum 
¡ Ni na Ke N4 N5 y checksum bytes are 4 bytes for chunk, full stripe 
Bytes checksum bytes are 5 * 4bytes 


q (Note: in the interest of space picture only 

showing 2 bits checksum length for chunk) 

| So, we can find specific chunk position 

Bree cee E checksum bytes from full stripe 
checksum bytes. 

We can store these checksum bytes 


only at P + 1 nodes. 


B=: borers aa Jd Why P+1 nodes to store checksums? 
` z o At any point of time, reconstruction needs d 
nodes. 


o So, at least one node of P+1 will participate in 
reconstruction. 
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Pros Cons 


>  getCheckSum API can make sure of these >  Little(negligible) overhead of storing these 
precalculated stripe checksum addition stripe checksum bytes at DNs 

> When node lost, recovery node can use 
these checksum bytes to validate 
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Recovery Time Validation 


E EA >  N2,N3,N5 are used for 


F EC Block Group : 1 y reconstruction. 
N1 N2 N3 N4 N5 


¡ Full stripe = 
: y checksum 


i/ bytes 
Y 


On recovery of chunks, recovery node 


can use N5’s original stripe checksum 


to validate with newly generated 


chunk checksum. 


EC = a LE-—.—.J|—.—.— . UI || | IA 
Client 


> If mathes, chunk reconstruction is 


successful. 
> If does not matches, reconstruction is 


L | — F wrong and recovery would fail. 


1 130 rl 
Recovery source nodes N2, N3, N5. i 17 


N5 has stripe checksums d = 
PEE ae 


EC supported in a scalable Object Storage: Apache Ozone 


Ly Github repo: https://github.com/apache/ozone 
Y Looking to contribute to the Apache Ozone EC project? 


Q Start with https://github.com/apache/ozone/blob/master/CONTRIBUTING.md 


1 Bug reporting is at: https://issues.apache.org/jira/projects/HDDS 
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Want to join a Storage Meetup? 


> There is astorage meetup happening in Santa Clara on Oct 25. 


> Scanthe below QR code for details 


ee 
e © ..... ee © © 
000 0 0000000 
ee ee . © 
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